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Learning joint intensity-depth sparse representations 

Ivana Tosic and Sarah Drewes 



Abstract 

This paper presents a method for learning overcomplete dictionaries composed of two modalities that describe 
a 3D scene: image intensity and scene depth. We propose a novel Joint Basis Pursuit (JBP) algorithm that finds 
related sparse features in two modalities using conic programming and integrate it into a two-step dictionary learning 
algorithm. JBP differs from related convex algorithms because it finds joint sparsity models with different atoms 
and different coefficient values for intensity and depth. This is crucial for recovering generative models where the 
same sparse underlying causes (3D features) give rise to different signals (intensity and depth). We give a theoretical 
bound for the sparse coefficient recovery error obtained by JBP, and show experimentally that JBP is far superior to 
the state of the art Group Lasso algorithm. When applied to the Middlebury depth-intensity database, our learning 
algorithm converges to a set of related features, such as pairs of depth and intensity edges or image textures and 
depth slants. Finally, we show that the learned dictionary and JBP achieve the state of the art depth inpainting 
performance on time-of-flight 3D data. 

Index Terms 

Sparse approximations, dictionary learning, hybrid image-depth sensors. 

I. Introduction 

Hybrid image-depth sensors have recently gained a lot of popularity in many vision applications. Time of flight 
cameras (TJ, (2| provide real-time depth maps at moderate spatial resolutions, aligned with the image data of the 
same scene. Microsoft Kinect (3) also provides real-time depth maps that can be registered with color data in 
order to provide 3D scene representation. Since captured images and depth data are caused by the presence of 
same objects in a 3D scene, they represent two modalities of the same phenomena and are thus correlated. This 
correlation can be advantageously used for denoising corrupted or inpainting missing information in captured depth 
maps. Such algorithms are of significant importance to technologies relying on image-depth sensors for 3D scene 
reconstruction or visualization (3j, (4), where depth maps are usually noisy, unreliable or of poor spatial resolution. 

Solving inverse problems such as denoising or inpainting usually involves using prior information about data. 
Sparse priors over coefficients in learned linear generative models have been recently applied to these problems with 
large success (5)-|[7j- A similar approach has been proposed for learning sparse models of depth only, showing state- 
of-the-art performance in depth map denoising and offering a general tool for improving existing depth estimation 
algorithms (8j. However, learning sparse generative models for joint representation of depth and intensity images 
has not been addressed yet. Learning such models from natural 3D data is of great importance for many applications 
involving 3D scene reconstruction, representation and compression. 

This paper proposes a method for learning joint depth and intensity generative models. Each of these two 
modalities is represented using overcomplete linear models, resulting in two sets of coefficients. These two sets 
are coupled via a set of hidden variables, where each variable multiplies exactly one coefficient in each modality. 
Consequently, imposing a sparse prior on this set of coupling variables results in a common sparse support of 
intensity and depth. Each of these hidden variables can be interpreted as presence of a depth-intensity feature pair 
arising from the same underlying cause in a 3D scene. To infer these hidden variables under a sparse prior, we 
propose a convex, second order cone program named Joint Basis Pursuit (JBP). Compared to Group Lasso (GL) [9|, 
which is commonly used for coupling sparse variables, JBP gives significantly smaller coefficient inference error. 
In addition, we derive theoretical bounds for the coefficient recovery error in JBP, exploiting the restricted isometry 
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property (RIP) fTU) of the model. Finally, we propose an intensity-depth dictionary learning algorithm based on the 
new model and JBP. We show its superiority to GL in model recovery experiments using synthetic data, as well as 
in inpainting experiments using real time-of-flight 3D data. 

Section [n| introduces the proposed intensity-depth generative model. Inference of its hidden variables is achieved 



via the new JBP algorithm presented in Section [III} while learning of model parameters is explained in Section [TV 
Section [V] gives relations of the proposed JBP to prior art. Experimental results are presented in Section [VI 



II. Intensity-depth generative model 

Before introducing the proposed intensity-depth generative model, let us set the notation rules. Throughout the rest 
of the paper, vectors are denoted with bold lower case letters and matrices with bold upper case letters. Letters /, D 
in superscripts refer to intensity and depth, respectively. Sets are represented with calligraphic fonts. Column-wise 
and row-wise concatenations of vectors a and b are denoted as [a b] and [a;b], respectively. 




Fig. 1. Graphical representation of the proposed intensity-depth generative model. 



Let us now explain the proposed joint depth-intensity generative model whose graphical representation is shown 
in Figure [I] Intensity image y 7 and depth image y D (in vectorized forms) are assumed to be sparse in dictionaries 
resp. <I> D , i.e. they are represented as linear combinations of dictionary elements: 



n 



(1) 



where vectors a and b have a small number of non-zero elements and r) 1 and r) D represent noise vectors. 1q is the 
set of indexes identifying the columns (i.e., atoms) of <& 7 and <fr D that participate in sparse representations of y 1 
and y D . Its cardinality is much smaller than the dictionary size, hence \Xq\ <^ where 1 = {1, 2, N} denotes 
the index set of all atoms. This means that each image can be represented as a combination of few, representative 
image features described by dictionary elements, modulated by their respective coefficients. Because depth and 
intensity features correspond to two modalities arising from the same 3D features, we model the coupling between 
coefficients and b{ through latent variables Xi as: 



en 



■ m\xi 



Vi el, 



(2) 



where the variables mj , mf represent the magnitudes of the sparse coefficients and X{ represent the activity of these 
coefficients. Ideally, these variables should be binary where represents the absence and 1 represents the presence 
of a pair of depth-intensity features. In that case J2i x i counts the number of non-zero pairs of depth-intensity 
features. However, inference of binary values represents a combinatorial optimization problem of high complexity 
which depends on dictionary properties and the permission of noise, cf. (TT). We relax the problem by allowing 
xi to attain continuous values between and 1, which has been proven to provide a very good approximation in a 
similar context, cf., e.g., fT2| , (I3j. 
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By assuming that the vector x = (xi, X2, xn) t is sparse, we assume that y 7 and y D are described by a small 
number of feature pairs (<pj , <f>f) that are either prominent in both modalities (both m 7 and mf are significant) 
or in only one modality (either m 7 or mf is significant). In these cases X{ is non-zero, which leads to non-zero 
values for either a\ or bi, or both. If x\ is zero or very close to zero, both ai and b{ are also zero or very small. 
Hence, the sparsity assumption on x enforces a compact description of both modalities. It uses simultaneously 
active coefficients in both modalities, or when such pairs cannot approximate both images, the model also allows 
only one coefficient within a pair to be non-zero. Therefore, the model represents intensity and depth using a small 
set of joint features and a small set of independent features. The main challenge is to simultaneously infer the 
latent variables x, m 7 = {m\,m\, ...,rajy) and m D = (raf^ra^, ...,rajy) under the sparsity assumption on x. 
In the next section we propose to solve this problem via convex optimization. 



III. Joint Basis Pursuit 
Let us re-write the intensity-depth generative model, including all latent variables, in matrix notation as: 



rD 



* 7 









D 



where M 7 diag(m 7 , m^, rajy) and Nl D diag^f^mf 



M 7 
M D 

D 



X + 



n 



D 



, m 



Suppose that we know the dictionaries 



<l> 7 and <J> D and we want to find joint sparse representations of intensity and depth, i.e., to infer latent variables 
x, m 7 , m D . To do this, we need to solve the following optimization problem: 



OPT1 : min^x^, where X{ E [0, 1], i = 1, ... 5 iV 



subject to: ||y 7 - * 7 M 7 x|| 2 < (e 7 ) 2 

|| y ^_^M D x|| 2 <(6 D ) 2 



<U D 



(3) 
(4) 
(5) 
(6) 



where e 7 , e D are allowed approximation errors and U 1 and U D are upper bounds on the magnitudes m 7 and m D . 
In practice, the values of these upper bounds can be chosen as arbitrarily high finite values. This optimization 
problem is hard to solve using the above formulation, since the first two constraints are non-convex due to the 
terms M 7 x and M D x which are bilinear in the variables x, m 7 and m D . To overcome this issue, we transform 
it into an equivalent problem by introducing the change of variables given by Eqs. ([2]) deriving: 

OPT2 : min^Xi, where X{ G [0,1], i = 1, N 



subject to: ||y 7 - $ 7 a|| 2 < (e 1 ) 2 
||y D -* D b|| 2 <(e D ) 2 

< U 1 X{ 
\h\ < U D x u 



(7) 
(8) 
(9) 
(10) 



which is a convex optimization problem with linear and quadratic constraints that can be solved efficiently, i.e., in 
polynomial time, using log-barrier algorithms, cf. fl4| , (15). A variety of free and commercial software packages 
are available like IBM ILOG CPLEX (16), that we use in our experiments. 

The problems (OPT1) and (OPT2) are indeed equivalent using the variable transformation in Eqs. ([2]) as follows. 

Lemma 1. For any optimal solution (x*,a*,b*) of (OPT2), x* is also an optimal solution to (OPT1) with 
corresponding matrices (M 1 )*, (M D )* according to Q. 

Also, any optimal solution (x*, (M 1 )*, (M D )*) of {OPT 1) defines an optimal solution (x*,a*,b*) to (OPT2) . 

Proof: For any (x*,a*,b*) and corresponding (M 1 )*, (M D )* that satisfy Eqs. ([2]), conditions ([7]) and ^ are 
equivalent to ([3]) and ([4]) by definition. Moreover, since x* is nonnegative, conditions ([9]) and ( fTO] ) are equivalent 
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to ([5]) and ([6]). Hence, any x* that is optimal for (OPT2) with corresponding (a*,b*) is optimal for (OPT1) with 
corresponding (M 1 )*, (M D )* and vice versa. ■ 
An immediate consequence of the form of the objective function and constraints in (OPT2) is that x* is chosen 
such that ([9]) and ( [T0| ) are both feasible and at least one of them is active. Formally, this is stated by the following 
lemma. 

Lemma 2. For any optimal solution (x*,a*,b*) of (OPT2), at least one of the constraints ([9]) and ( [TO] ) is active 
for each component i, hence we have 

** = max{^,p}, Vi = l,...,JV. (11) 
Proof: Otherwise it would be a contradiction to the optimality of x*. ■ 

In the following, we refer to the optimization problem (OPT2) as Joint Basis Pursuit (JBP), where x is the vector 
of joint (coupling) variables in the signal model. It is important to know the theoretical bounds on the norm of the 
difference between the solution (a*,b*) found by JBP and the true coefficients (a, b) of the model ([!]). 

Based on the non-coupled case that is treated in |TT), we develop bounds on the difference of the optimal solution 
of (OPT2) and a sparse signal to be recovered. For this purpose, we assume that the matrix 

* 7 
® D 



A :-- 



(12) 



satisfies the restricted isometry property with a constant 5s, which refers to the following characteristics of the linear 
system. Denote At, T C 1, n as the n x |T| submatrix obtained by extracting the columns of A corresponding 
to the indices in T. The S-restricted isometry constant 5s is then defined as: 



Definition 1. / [TO] / The S-restricted isometry constant 5s of A is the smallest quantity such that 

(1-S 3 )\\s\\l < ||A T s||! < (1 + S s )\\s\\ 2 2 (13) 

for all subsets T with \T\ < S and coefficient sequences (sj), j E T . 

When 5s « 1, this property requires that every set of columns with cardinality less than S approximately 
behaves like an orthonormal system. It can thus be related to the maximal value of the inner product between any 
two columns in the matrix A, usually called the coherence of the dictionary: 

fx = max \{<t>i,<t>j)\, (14) 

where (pi and (pj are two different atoms in the dictionary (i.e., two columns of A) and (•) denotes the inner product. 
With this definition, it can be easily shown that 5s = n(\T\ — 1) satisfies the RIP inequality ( [T3| ). 

Before we present the bound on the coefficient recovery error of JBP, let us first define some prerequisites. 
Assume we are given a pair of sparse signals (y 7 , y D ) as in Eq. ([!]), with sparse coefficients (a , b°), which satisfy 
constraints {7]) and ([8]). Let Tq be the support of x° which is at the same time the support of at least a or b° and 
contains the support of the other one or it coincides with the support of both. Without loss of generality, let us 
assume that 

||y 7 ||2 = ||y D || 2 =:/o, (15) 

which can be easily obtained by normalization. Assume also that the components of a and b° satisfy the bound 
constraint^] 

\*i\ </o, \b°i\ </o, Vi = l,...,JV, (16) 

i.e., in the remainder of the paper we assume the same bounds on and b { \ U 1 = U D = U = f . It is also useful 
in practice to select the approximation error e in terms of the fraction of the total signal energy, so we denote 
e 77/0, where < 77 < 1. 



1 Although the assumption in Eq. ( 16 1 does not hold in general, in practical applications using learned dictionaries we found that it is 
always satisfied. However, if one wants to use a bound that is always satisfied, one should choose U = fo/<J m in, where <j m in is the smallest 
of all singular values of <£ J and & D . 
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Let further a\ denote the scale between the smaller and larger coefficient for each index i within the sparse 
support set To, i.e.: 



a. 



0| 



ai = min{^, j-^}, Vi € T , (17) 

and let 7 denote: 



'.1 \ a i\ 



7 = 1 — minc^. (18) 

Parameter 7 describes the level of similarity between the sparse coefficients in the two signals, which is decreasing 
with higher similarity. In the trivial case when a® = for all i E Tq we have that 7 = 0. In all other cases 7 < 1. 
Due to the common sparse support of intensity and depth as assumed by the signal model, we usually have a® ^ 
and ^ 0, for all i E To, and hence 7 < 1. 

Let further x° denote an auxiliary vector that satisfies 

m a x{\a%\$\} = Ux° i , 

namely (x°,a°,6°) is a feasible solution to (OPT2), where x° is chosen such that ([9]) and ( [T0| ) are both feasible 
and (at least) one of them is active. 

Finally, let (x*, a*, 6*) be an optimal solution to (OPT2). Then we have the following worst case bound on the 
distance of these. 

Theorem 1. Let (a , 6°) and (a*, 6*) as defined above and choose U — fo with fo from ( p3] > and e 1 — e D — r]fo, 
where < 77 < 1. Then 



a ;b°]-[a*;b*]\\l< 



^L(C + 7 ) 2 + C 2 



fo (19) 



holds for a constant C that depends on the signal model parameter 7, the sparse support size |To|, approximation 
parameter 77 and where the M -restricted isometry property is satisfied for the linear system, cfi Defi [7] In particular, 
we have: 

c = 4 V VM + ~f\T \VTThi 

fM(l-S M+ \ To \)-y/\T \(l + S M )' 



The proof of this Theorem is given in Appendix [A] 

IV. Intensity-depth dictionary learning 

In the previous section we have shown how to find sparse coefficients in the joint depth-intensity generative 
model, assuming that the model parameters, i.e., dictionaries $ 7 and <I> D are given. Since we do not have those 
parameters in general, we propose to learn them from a large database of intensity-depth image examples. Dictionary 
learning for sparse approximation has been a topic of intensive research in the last couple of years. Almost all 
existing algorithms are based on Expectation-Maximization, i.e., they are iterative algorithms that consist of two 
steps: 1) inference of sparse coefficients for a large set of signal examples while keeping the dictionary parameters 
fixed, and 2) dictionary optimization to minimize the reconstruction error while keeping the coefficients fixed. We 
follow the same approach here, using JBP in the first step and then conjugate gradient in the second step. Once JBP 
finds the sparse coefficients (a*, b*) and the coupling variables x, optimization of $ 7 and <I> D becomes decoupled. 
Therefore, in the learning step we independently optimize the following objectives: 

(* 7 )* = min ||Y 7 - * 7 A||^ + p||* 7 ||f (21) 
<i> 7 

= min ||Y D - * D B||^ + (22) 

where \\ • \\f denotes the Frobenius norm, Y 7 , Y D , A and B are matrices whose columns are y 7 , y 7 ^, a.j and bj 
respectively, and j = 1, J indexes the signal examples from a given database. In addition to the reconstruction 
error, we have added a normalization constraint on the dictionaries, scaled by a small parameter p, in order to control 
the dictionary norms as usually done in dictionary learning. Before showing the performance of the proposed learning 



algorithm, we review prior art that we will use for experimental comparisons in Section VI 
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V. Relation to prior art 

To the best of our knowledge, there has not been any work that addresses the problem of learning joint intensity- 
depth sparse representations. Therefore, we overview prior work that focuses on sparse approximation algorithms that 
bear similarities to the proposed JBP algorithm. Since the main characteristic of JBP is to find sparse approximations 
of two signals sharing a common sparse support, we overview algorithms targeting this problem. Such algorithms can 
be grouped into two categories with respect to the signal model they address: a) simultaneous sparse approximation 
algorithms, and b) group sparse approximation algorithms. We further discuss how algorithms from each group 
relate to JBP. 

Simultaneous sparse approximation algorithms recover a set of jointly sparse signals modeled as: 

y 5 = *x s + e s = ^2 4>i x s(i) + e 5 , s= 1, 5, (23) 

where S is the total number of signals y s , <fr is the dictionary matrix and e s is a noise vector for signal y 5 . Vectors 
of sparse coefficients x s share the same sparsity support set X, i.e., they have non-zero entries at the same positions. 
One of the earliest algorithms in this group is the Simultaneous Variable Selection (SVS) algorithm introduced by 



Turlach et. al. fT7| . SVS selects a common subset of atoms for a set of signals by minimizing the representation 
error while constraining the ^i-norm of the maximum absolute values of coefficients across signals. Formally, SVS 
solves the following problem: 

1 S 

(SVS): min-^||y s -*x s || 2 , (24) 

Z 8 = 1 

subject to: ^ max{|xi(i)|, ...,\xs(i)\} < r, (25) 

i 

where r is given. Let X denote the matrix with x s , s = 1, S as columns. We can see that the left hand side of the 
constraint in SVS is obtained by applying the ^-norm to rows (to find the largest coefficients for all explanatory 
variables), followed by applying the ^i-norm to the obtained vector in order to promote sparsity of the support. We 
denote this norm as HXHo^i. Versions of the same problem for the unconstrained case and the error-constrained 
case have been studied by Tropp (I8j. 

To see the relation of SVS to JBP, we use Lemma |2j which allows us to formulate the JBP for the special case 
of U 1 = U D as: 

min : t (26) 
subject to: ||y D - * D a|| 2 < e 2 (27) 
||y 7 - * 7 b|| 2 < e 2 (28) 



^max{|oi|,|&i|} <t. (29) 



Therefore, JBP operates on the same ^o^i-norm of the coefficient matrix as SVS. However, in contrast to SVS, 
JBP minimizes the number of non-zero elements in both a and b by minimizing ||[a bJHo^i and constraining the 
approximation error induced by the coefficients. A much more important difference of our work and fT7| is that 
we allow for different sets of atoms for intensity and depth. Thus, in JBP, each signal can be represented using 
a different dictionary, but with coefficient vectors that share the same positions of non-zero entries. This makes 
JBP applicable to intensity-depth learning, in contrast to SVS. Finally, we remark here that choosing the objective 



function as we did allows for a smooth convex representation of the last constraint ( |29| ). 

Group sparse approximation algorithms recover a signal modeled as: 

y = ^Hpq + e, (30) 

i 

where is a submatrix of a big dictionary matrix H. This model is useful for signals whose sparse support has a 
group structure, namely when groups of coefficients are either all non-zero or all zero. The first algorithm proposed 
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for group sparse approximation was a generalization of Lasso, developed by Bakin (9j, and later studied by other 
authors (e.g. Yuan and Lin fT9|). Group Lasso refers to the following optimization problem: 

(GL) : min Hy-^Hpqf + A^Hxillp, (31) 

i i 

where || • || p denotes the ^-norm. The most studied variant of group lasso is for p = 2, because it leads to a 
convex optimization problem with efficient implementations. The group sparsity model can be used to represent 
intensity-depth signals by considering pairs (a^,^),i = 1,...,7V as groups. In this case, group lasso with p = 2 
becomes: 

(GL-ID) : min ||y 7 - <ti«if + lly D " J2 + A E V ' a i + 6 * 

i i i 

The drawback of GL with p = 2 is that it penalizes independent components (pairs with only one large coefficient) 
over the more balanced ones (pairs with similar coefficients). Choosing p = oc avoids this problem and allows 
selection of pairs with unbalanced coefficients. In that case the regularizer penalizes the norm ||[a b]||oo 5 i. Rather 
than solving the unconstrained problem of group lasso with p = oc and a non-smooth objective, JBP reaches a 
similar goal by solving a constrained convex optimization problem with smooth constraints. It also eliminates the 
need for tuning the Lagrange multiplier. 

VI. Experimental results 

We have performed two sets of experiments in order to evaluate the proposed JBP and dictionary learning based 
on JBP. The first set of experiments uses simulated random data, with the goal to determine the model recovery 
performance of JBP when the ground truth signal models are given. In the second set, we apply JBP and dictionary 
learning on real depth-intensity data and show its performance on a depth inpainting task. In both cases, JBP has 
been compared to Group Lasso (GL). 



A. Model recovery 

To evaluate the performance of JBP, we have generated multiple sets of pairs of signals of size N = 64, denoted 
by {y 1 -} and {yf}, j = 1, 500. Signals in each pair have a common sparsity support of size |Tb|, and they are sparse 
in random, Gaussian iid dictionaries $ 7 and <f> D of size 64 x 128. Their coefficients, {aj} and {bj}, j = 1,500 
are random, uniformly distributed, and do not have the same values nor signs. However, their ratios ai (as defined 
in Eq. [17]) are bounded from below, which gives a certain value of 7 (see Eq. [18]) for each set. Hence, we assume 
some similarity in the magnitudes within each pair of coefficients of the two modalities. All signals have been 
corrupted by Gaussian noise. 

Figure [2] shows the relative coefficient reconstruction error ||a* — a l 1 2/ 1 1 a ll 2 "I - ll^* ~~ ^ Hi/ ll^ 3 Hi' where (a*,b*) 
are the reconstructions of original values (a, b). The error is averaged over 50 different signals and plotted versus 
the signal-to-noise (SNR) ratio between sparse signals and Gaussian noise. Each of the four graphs is obtained for 
different values of parameters |Tb| and 7. 

We have compared JBP with GL and with the theoretical bounds derived in Section III for M = 25 and M = 64. 
Instead of using the dictionary coherence value for 5, which would give the worst-case bounds, we use the mean 
of inner products between all atoms to obtain and plot the average case bounds. We can see that JBP always 
outperforms GL for a large margin. Moreover, the actual performance of JBP is much better than predicted by the 
theory, showing that the average derived bound is rather conservative. 

Furthermore, we have used these randomly generated signals as training sets in our dictionary learning algorithm, 
in order to recover the original dictionary. For four different values of sparsity |To| = 2,4,6,8, we have applied 
the proposed learning algorithm starting from a random initial dictionary. For comparison, we have replaced the 
JBP in the inference step with GL, while keeping the learning step exactly the same. We refer to this method as 
GL-based learning. Figure |3ja) shows the mean square error (MSE) between the original atoms and the recovered 
ones vs sparsity |Tq|, for JBP and GL-based learning and two values of 7: 0.1 and 0.25. Similarly, we plot in 
Figure [3jb) the percentage of recovered atoms vs sparsity, where an atom is considered recovered when its MSE 
is less than 0.05. Below this threshold the comparison is impossible since GL recovery error is huge (almost 
recovered atoms). We can see from both graphs that learning based on JBP is far superior to GL and that JBP is 
less sensitive to values of 7 than GL-based learning. 
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Fig. 2. JBP model recovery performance for random signals. Average relative coefficient reconstruction error is plotted for different signal- 
to-noise (SNR) ratios between sparse signals and Gaussian noise. Different graphs are for different parameter values: (a) |To| = 5,7 = 0.1, 
(b) |T | = 5,7 = 0.25, (c) |T | = 10, 7 = 0.1, (d) |T | = 10, 7 = 0.25. 
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Fig. 3. Recovery performance of dictionary learning using JBP and comparison to GL. (a) Mean square error between recovered atoms 
and original atoms vs sparsity |Tq|, for different values of 7. (b) Percentage of recovered atoms vs sparsity |To|. 
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B. Intensity-depth dictionary learning 

In our second set of experiments we have evaluated the performance of JBP and dictionary learning on real 
data, in particular on depth-intensity images. We have learned a depth-intensity overcomplete dictionary on the 
Middlebury 2006 benchmark depth-intensity data (20). The intensity data has been whitened, i.e., its frequency 
spectrum has been flattened, as initially proposed in (5). Such pre-processing speeds up the learning. Depth data 
could not be whitened because it would introduce Gibbs artifacts around the missing regions at occlusions. We 
handle such missing pixels by masking. Learning has been performed in a patch-mode. Namely, in each iteration of 
the two-step learning process, a large number of depth-intensity pairs of 12 x 12 patches have been randomly selected 
from data. Each depth and intensity patch within a pair coincide in a 3D scene. Patches have been normalized to 
have norm one, and 77 has been set to 0.1. We have chosen this value such that we get a good reconstruction of 
depth, without the quantization effects present in Middlebury depth maps (i.e., such that the quantization error is 
subsumed by the reconstruction error). We have learned dictionaries <f> J and <J> D , each of size 144 x 288, i.e., twice 
overcomplete. For comparison, we have also learned depth-intensity dictionaries using GL-based learning, where 
A = 0.3 has been chosen such that we obtain in average the same reconstruction error as in JBP. 

Figures |4ja) and (b) show dictionaries learned by JBP and GL, respectively. The JBP-learned dictionary contains 
more meaningful features, such as coinciding depth-intensity edges, while GL-learned dictionary only has few of 
those. JBP dictionary atoms also exhibit correlation between orientations of the Gabor-like intensity atoms and the 
gradient angle of depth atoms. This is quite visible in the scatter plots of intensity orientation vs depth gradient 
angle shown in Figure |5j We can see that for JBP there is significant clustering around the diagonal (corresponding 
to a 90° angle between orientation and gradient), while there is no such effect for GL. 

Finally, we have compared the performance of JBP and GL, and the corresponding learned dictionaries, on an 
inpainting task. Namely, we have randomly removed 95% of depth pixels from an intensity-depth pair obtained by 
a time-of-flight (TOF) camera^] We have chosen the TOF data to show that learned dictionaries of intensity-depth 
are not linked to particular depth sensors. From the original intensity image and 5% of depth pixels, we have 
reconstructed the whole depth image, using: 1) JBP and JBP-learned dictionaries (<&{, 2) GL and GL-learned 
dictionaries (^l^,*^), and 3) GL and JBP-learned dictionaries (<l>{, ^f). We can see from the Figure [6] that 1) 
gives the best performance, followed by 3) and 2). Thus, both the JBP dictionary and the JBP inference result 
in much better performance compared to GL. We do not present comparisons with methods based on depth only 
(e.g. |8|) since their performance is inferior as they do not use the intensity information. 

VII. Conclusion 

We have presented an algorithm to learn joint overcomplete dictionaries of image intensity and depth. The 
proposed method is based on a novel second order cone program for recovering sparse signals of joint sparse support 
in dictionaries with two modalities, named JBP. We have derived a theoretical bound for the coefficient recovery 
error of JBP and shown the superiority of JBP to the Group Lasso algorithm through numerical simulations. When 
applied to the Middlebury image-depth database, the proposed learning algorithm converges to an overcomplete 
dictionary of interesting intensity-depth features, such as coinciding edges and image grating - depth slant pairs. 
JBP and this learned dictionary yield state of the art performance in depth inpainting. Both JBP and our learned 
dictionary are of important value to 3D technologies based on hybrid image-depth sensors. 

Appendix 

A. Proof of Theorem [7] 

Before proving this theorem, let us prove the following lemma. 

Lemma 3. For 

h:= [a*;b*]-[a°;b°] 

it holds true that: 

llhgyHi < ||h ro ||i+7tf-|To|, (32) 

1 http://www.pmdtec.com/ 
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(b) 



Fig. 4. Learned intensity-depth dictionaries. Each column contains a set of atom pairs (</>{, (f) D ), where the left part is an intensity atom 
and the right part is a depth atom, (a) JBP-learned dictionaries <!>?), (b) GL-learned dictionaries (4>2, ^?)- 
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gradient angle for depth atoms [deg] 



orientation angle of image at 



(b) 



(c) 



Fig. 5. Correlation between depth atom gradients and image atom orientations, a) Illustration of atom pairs that have 90 degrees angle 
between the orientation of the Gabor-like intensity part and the gradient angle of the depth part. Scatter plots of orientation vs gradient angle 
for b) JBP and c) GL. 
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(a) 



(b) 



(c) 



(d) GL 25.7 dB 



(e) JBP 27.8dB (f) GL-*i 26.6dB 



Fig. 6. Inpainting results on time of flight data, a) Original intensity image, b) Original depth image, c) 5% of kept depth pixels, d) 
reconstructed depth with GL and GL-learned dictionaries (<I>2, ^?)» e ) reconstructed depth with JBP and JBP-learned dictionaries (<I>i, <&f), 
f) reconstructed depth with GL and JBP-learned dictionaries (<I>2, $2). 
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where Tq denotes the complement set of Tq and \it denotes the subvector of h corresponding to T. 
Proof: Define 

X^ :={i€l: |a°| = U Xi }, 
l° b :={zel\l a :\b i \ = Ux z }, 
1* :={i GX: \a*\ = U Xi *}, 
T b :={i<El\T a :\b*\ = U Xi *}. 



Due to Lemma |5J we have that ij U = X and I* U I 5 * = X, and due to the definition above it holds that 
jo H X° = an d J* n X* = 0. Therefore, we have that: 



Similarly, we have that: 



[a*;b*]||i = J2 Kl + E \ b *\ + E + E \°i 

i£Z* i£%b i^T* iGXJ 

<£/J>*| + t/J>*| + £/£ 

iGX *^X* *£X£ 

= 2C/||x*||i. 



iGXQ ieZj iEZj 



>GD 2E/||x°||i -jU\T \. 



Due to optimality of x*, we have ||x*||i < ||x°||i, which combined with (33) and (34) gives: 

||[a*;b*]|| 1 <2C/||x || 1 <||[a ;b ]|| 1 + 7 C/|Xo|. 

Due to a^ c and h% c 0, we can write 

||[a ;b ]+h||i = ||[a° ;bO o ;0] + [h To ;h T o]|| 1 

= \\[*T i h T ] + *-T h + l|h r( flli- 
Thus, using the triangle inequality and the definition of h we derive: 

llta^b ]!)! - ||h ro ||i + Hh^lli < ||[a°;b°] + h||! = ||[a*;b*]||i < S II [a ; b°] ||i + jU\T \ 

and thus 

||h r c||i < ||h To ||i + 7C/|Xb|. 



(33) 



(34) 
(35) 

(36) 



(37) 



We are now ready to prove Theorem [T] 

Proof: Let A be defined as in Eq. ( fT2| ). Then we have from ([7]) and ([8]) that 

||Ah|| 2 < 4e = 4rif . 

Assume we have divided Tq into subsets of size M, more precisely, we have Tq = T\ U • • • U T n _| To |, where Tj 
are sorted by decreasing order of h T c, and where Tqi — Tq\JT\. Without alternations - cf. pT| - it holds true that 



Using < f37] > yields now 



HVJli ^ W h T c\\i/M. 

\h T c\\ 2 2 <(\\h To \\ 1 + 1 U\T \) 2 /M 

< (|ro|||h To || 2 + 7f/|Xb|) 2 /M, 



(38) 
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where the second step follows from the norm inequality. Hence: 

|2 _ Mi — ||2 , \\u „l|2 



h 1 1 2 = ||hy 01 1| 2 + ||hj-c || 2 

<n i ™ 2 )\\h 01 \\* i 27 ^ r °l 2 Hh ii i ^I Tq D 2 ™ 

< (1 + - ir )||h To || 2 + — ||h To || 2 + — . (39) 



From the restricted isometry hypothesis, cf. Def. [T] we get 

||Ah|| 2 = ||A Toi h Toi +^A T .h T .|| 2 

> ||A Tol h Toi || 2 - || ^ A T .h T J| 2 

> ||A Toi h Toi || 2 - ^ ||A Tj .h Tj .||2 

J>2 



> y 1 - ^M+|T |ll h Toi||2 - y 7 1 + £m ^ llhrjb 

j>2 

> - 5 M+|To|ll h T ||2 - V^mJ] H h ^ll2 (40) 

J>2 

where 5# is a constant chosen such that the inequalities hold, which follows from inequality (4) in (TT). Here, A^ 
denotes the columns of A corresponding to the index set T. 



In analogy to |11|, due to the ordering of the sets Tj by decreasing order of coefficients, we have: 

|h Tj+1 (t)| < ||h T Ji/M 

meaning each component in h^ +1 is smaller than the average of the components in (absolute value-wise). 
Thus, we get: 

l|hT, +1 |li= £ ||h* 

teT j+1 

T 2 



< J2 W^Wi/M 2 

ter J+1 

< M\\h T .Wl/M 2 = \\h T .||2/M, 



and 



£||h T J 2 < £||h T .||i/\/M 

j>2 j>l 

||h T c||i/\/M 
<^(||h To ||i+7^|To|)/>/Af 

< ^/\To\/M\\h To \\ 2 + jU\T \/VM (41) 
where the last step follows from the norm inequality. Combining Eq. ([41]) and Eq. ( |40| ), we get: 

Ph|| 2 >^l-5 M+ | To |||h To || 2 - ^1 + W|ro|/M||h T J 2 - jU\T \Vl + 8m/Vm (42) 
and subsequently: 



„ ^ \\Ah\\ 2 +jU\T \VT+5^/VM 

\u.Tn 2 ^ — ^ . — / = (4J ) 

< 47 ? / VM + 7/o|To|VrT^ 



'M(l - 5 M+ \ Ta \) - ^\T \(1 + 8 M ) 
= Cf , (44) 



13 



if the denominator is greater than zero. Replacing this result in Eq. ( [39] ) and taking U = fo we get: 

NVil|2 n i l^Ql 2 ^2 r2 i o^ J^Q| 2 ^f^2 i ^2 l^oj 2 ^2 

Il h ll2 < ( 1 + ^~) C +27 ^~ C ^o +7 -Jj-JO' ( 45 ) 



which is equivalent to (19) and thus completes the proof. 
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